nvme: downgrade WARN in nvme_setup_rw to pr_debug#772
nvme: downgrade WARN in nvme_setup_rw to pr_debug#772blktests-ci[bot] wants to merge 2 commits intolinus-master_basefrom
Conversation
|
Upstream branch: dd6c438 |
857ada9 to
482ce5b
Compare
|
Upstream branch: dca922e |
1f3fcb5 to
90fdc18
Compare
482ce5b to
5a9f7c7
Compare
|
Upstream branch: e75a43c |
90fdc18 to
a0720e2
Compare
5a9f7c7 to
25a041f
Compare
|
Upstream branch: 66edb90 |
a0720e2 to
954d232
Compare
25a041f to
6f75bd1
Compare
|
Upstream branch: 6d35786 |
954d232 to
330f876
Compare
6f75bd1 to
1f0d33a
Compare
|
Upstream branch: 6d35786 |
330f876 to
e503284
Compare
1f0d33a to
b1870f6
Compare
When an NVMe namespace is configured with embedded metadata (flbas bit 4 set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on any request that reaches it with REQ_INTEGRITY unset. The WARN was observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer that performs semantic mutation of Identify Namespace responses. The trigger requires three conditions to align: (a) a namespace transitions through the EXT_LBAS non-PI state (head->ms != 0, features & NVME_NS_EXT_LBAS, !(features & NVME_NS_METADATA_SUPPORTED)), (b) nvme_init_integrity() returns false through the early-exit branch at core.c:1834 without populating bi->metadata_size, leaving the disk without an integrity profile (blk_get_integrity() returns NULL), and (c) a request that was admitted to the block layer before the namespace update reaches nvme_setup_rw() after it. The admission gap arises in two places. First, the plug-list flush path: a process with dirty pages queued in a plug before the namespace update flushes them on file close (blk_finish_plug -> blk_mq_dispatch -> nvme_setup_rw), bypassing any capacity-zero gate. Second, the cached-rq path: blk_mq_submit_bio() at blk-mq.c:3155 may find a cached request; if so, the bio_queue_enter() freeze-serialization guard at blk-mq.c:3174-3176 is skipped and the bio is dispatched immediately. In both cases the bio was submitted without REQ_INTEGRITY (because blk_get_integrity() returned NULL at dispatch time, so bio_integrity_action() returned 0 and bio_integrity_prep() was not called), and it reaches nvme_setup_rw() for a namespace where head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles this dispatch; the WARN_ON_ONCE is a false positive. The WARN was reproduced six times over four days of fuzzing (April 2026). A representative crash shows the plug-flush path: nvme0n1: detected capacity change from 2097152 to 0 WARNING: drivers/nvme/host/core.c:1042 at nvme_setup_rw+0x768/0xfd0 PID: 785 (systemd-udevd) Call Trace: nvme_setup_cmd / nvme_queue_rq / blk_mq_dispatch_rq_list blk_mq_flush_plug_list / blk_finish_plug / blkdev_writepages sync_blockdev / bdev_release / __fput / sys_close Replace WARN_ON_ONCE with pr_debug_ratelimited so the condition is logged at debug level without splat. The BLK_STS_NOTSUPP return is preserved; I/O to the transitioning namespace is still rejected. An alternative approach that addresses the root cause at the integrity-profile level is proposed in patch 2/2: populate bi->metadata_size for EXT_LBAS non-PI namespaces in nvme_init_integrity() so that bio_integrity_action() returns non-zero, bio_integrity_prep() sets REQ_INTEGRITY, and nvme_setup_rw() never reaches this branch. Both patches are sent as RFC for maintainer guidance on the preferred direction. Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST). Boot-tested under FEMU with NVME_MALICIOUS_RESPONDER=1 NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500 rescan_controller cycles. No WARN, BUG, or Oops observed. Found by FuzzNvme(Syzkaller with FEMU fuzzing framework). Acked-by: Sungwoo Kim <iam@sung-woo.kim> Acked-by: Dave Tian <daveti@purdue.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Signed-off-by: Chao Shi <coshi036@gmail.com>
This patch is an alternative to patch 1/2: instead of downgrading the assertion in nvme_setup_rw(), it addresses the root cause at the integrity-profile level so that the assertion is never reached. For PCIe namespaces with extended LBAs (NVME_NS_EXT_LBAS set, flbas bit 4) but without PI and without NVME_NS_METADATA_SUPPORTED, the early- exit branch of nvme_init_integrity() at core.c:1834 returns false without populating bi->metadata_size. As a result blk_get_integrity() returns NULL (it checks q->limits.integrity.metadata_size via blk_integrity_queue_supports_integrity()), bio_integrity_action() returns 0, bio_integrity_prep() is never called, and REQ_INTEGRITY is never set on bios dispatched to the namespace. Any such bio that reaches nvme_setup_rw() triggers WARN_ON_ONCE because head->ms != 0 but blk_integrity_rq() returns false. Populate bi->metadata_size = head->ms in the early-exit path for the EXT_LBAS non-PI case. This is sufficient to make blk_get_integrity() return non-NULL, which causes bio_integrity_action() to return non-zero, which causes bio_integrity_prep() to run and set REQ_INTEGRITY on any bio submitted to the namespace. Requests that reach nvme_setup_rw() then satisfy blk_integrity_rq() and the assertion is not reached. blk_validate_integrity_limits() accepts this configuration: with csum_type=BLK_INTEGRITY_CSUM_NONE, pi_tuple_size=0, and pi_offset=0, all checks pass (pi_offset + pi_tuple_size <= metadata_size, pi_tuple_size must be 0 for CSUM_NONE), and interval_exp is auto-filled to ilog2(logical_block_size). No generate/verify callbacks are configured, so no actual integrity computation occurs; only the blk_integrity_rq() predicate is satisfied. Capacity is still forced to 0 by set_capacity_and_notify(), so new bios are rejected by bio_check_eod() before queue entry. Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST). Boot-tested under FEMU with NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500 rescan_controller cycles with no WARN, BUG, or Oops. The EXT_LBAS + ms!=0 + !PI combination was not triggered during testing (FEMU's mutator varies flbas and lbaf[0].ms independently; flbas=0x10 with lbaf_idx=0 was not produced in this run). The bi->metadata_size assignment path was not exercised in testing; correctness of blk_validate_integrity_limits() for this configuration was verified by code inspection. Provided as RFC. Found by FuzzNvme(Syzkaller with FEMU fuzzing framework). Acked-by: Sungwoo Kim <iam@sung-woo.kim> Acked-by: Dave Tian <daveti@purdue.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Signed-off-by: Chao Shi <coshi036@gmail.com>
|
Upstream branch: aa54b1d |
e503284 to
f7e567e
Compare
Pull request for series with
subject: nvme: downgrade WARN in nvme_setup_rw to pr_debug
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=1085752